Unlocking the potential: T1-weighed MRI as a powerful predictor of levodopa response in Parkinson’s disease

Background The efficacy of levodopa, the most crucial metric for Parkinson’s disease diagnosis and treatment, is traditionally gauged through the levodopa challenge test, which lacks a predictive model. This study aims to probe the predictive power of T1-weighted MRI, the most accessible modality for levodopa response. Methods This retrospective study used two datasets: from the Parkinson’s Progression Markers Initiative (219 records) and the external clinical dataset from Ruijin Hospital (217 records). A novel feature extraction method using MedicalNet, a pre-trained deep learning network, along with three previous approaches was applied. Three machine learning models were trained and tested on the PPMI dataset and included clinical features, imaging features, and their union set, using the area under the curve (AUC) as the metric. The most significant brain regions were visualized. The external clinical dataset was further evaluated using trained models. A paired one-tailed t-test was performed between the two sets; statistical significance was set at p < 0.001. Results For 46 test set records (mean age, 62 ± 9 years, 28 men), MedicalNet-extracted features demonstrated a consistent improvement in all three machine learning models (SVM 0.83 ± 0.01 versus 0.73 ± 0.01, XgBoost 0.80 ± 0.04 versus 0.74 ± 0.02, MLP 0.80 ± 0.03 versus 0.70 ± 0.07, p < 0.001). Both feature sets were validated on the clinical dataset using SVM, where MedicalNet features alone achieved an AUC of 0.64 ± 0.03. Key responsible brain regions were visualized. Conclusion The T1-weighed MRI features were more robust and generalizable than the clinical features in prediction; their combination provided the best results. T1-weighed MRI provided insights on specific regions responsible for levodopa response prediction. Critical relevance statement This study demonstrated that T1w MRI features extracted by a deep learning model have the potential to predict the levodopa response of PD patients and are more robust than widely used clinical information, which might help in determining treatment strategy. Key Points This study investigated the predictive value of T1w features for levodopa response. MedicalNet extractor outperformed all other previously published methods with key region visualization. T1w features are more effective than clinical information in levodopa response prediction. Graphical Abstract


Introduction
Parkinson's disease (PD) is a neurodegenerative disorder with a growing prevalence [1].Its array of symptoms, including tremors, rigidity, bradykinesia, and postural instability, significantly impair patients' quality of life [1].Levodopa, a dopamine precursor, is the most used treatment [1][2][3].Clinicians often employ the levodopa challenge test (LCT), as its outcomes are crucial for making diagnoses and guiding treatment strategies, particularly that of deep brain stimulation [3].A predictive model for levodopa response could not only help clinicians determine treatment strategies [4] but also provide insights into potential pathophysiological mechanisms.
T1-weighted MRI is a widely available imaging technique that offers high-resolution brain images.While extensively used in clinical routine for diagnosing and differentiating PD [5][6][7][8][9][10] and predicting conversion from mild cognitive impairment to dementia [11], its potential for predicting levodopa response has been underexplored.For T1-weighted MRI, Ballarini et al [12] extracted agecorrected gray matter intensity from discriminative voxels between good and poor responders to predict LCT outcomes.Xie et al [13] constructed a morphological brain graph network to fetch individual-level network metrics for LCT result prediction.Furthermore, the PREDISTIM Study Group [4] utilized texture features from 16 subcortical regions of interest (ROIs) to construct feature vectors for each participant to predict LCT results.
Although these studies demonstrated the potential of T1-weighted MRI in levodopa response prediction, they either lacked adequate test sets and had limited sample sizes, or did not query the predictive ability of imaging features separately, leaving the underlying potential of T1-weighted MRI in levodopa response prediction unclear.Convolutional neural networks have demonstrated efficacy in brain MRI analysis prediction tasks, including PD diagnosis [14,15], but have not been utilized in levodopa response prediction.Therefore, the role of T1-weighted MRI in levodopa response should be further evaluated through a more persuasive predictive model.
In this study, we aimed to leverage the Parkinson's Progression Markers Initiative (PPMI) dataset and an external clinical dataset to evaluate the predictive potential of T1-weighted MRI for levodopa response prediction by comparing classification performance with and without imaging features and identify the underlying brain regions.

Data sources
The whole PPMI dataset was randomly split into training and test sets with a ratio of 8:2, ensuring that records from the same participant were in the same set, resulting in 173 and 46 records for the training and test sets, respectively.
The performance of the output models on actual samples was validated using an external clinical dataset with 217 records from Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, collected between 2017 and 2022.All included participants underwent standard LCT.Notably, these records were collected retrospectively from patients available for deep brain stimulation surgery, which might introduce potential bias to the dataset distribution, with longer disease duration, LEDD, and MDS-UPDRS III scores and a higher proportion of "good" responders (Table 1 and Fig. 1).
T1-weighted MRI scans from PPMI were acquired using 1.5-T (Philips) or 3-T (Siemens) scanners with an isotropic resolution of 1 mm, whereas those from Ruijin Hospital were isotropically acquired using 1.5-T or 3-T scanners (GE) with a resolution of 1 mm to 2 mm.

Feature extraction
Four feature extraction methods were evaluated, including three from published research and one proposed in this study.Details of the former methods are provided in the Supplementary Materials.In brief, the first one is agecorrected regional gray matter intensity extracted from CAT12 pre-processed images, following Ballarini et al [12], after which principal component analysis (PCA) was used to select the first 50 principal components as features.The second method, proposed by the PREDISTIM Study Group and Chakraborty et al [4,5] used subcortical ROI textures as PD biomarkers, by extracting and removing highly correlated texture features of 16 subcortical ROIs from ANTspre-processed images, encompassing caudate, putamen, thalamus, GPi, GPe, STN, SN, and RN using PyRadiomics (https://pyradiomics.readthedocs.io/en/latest/).The morphological graph was constructed using Kullback-Leibler and Jensen-Shannon divergence following Xie et al [13].
The graph metrics of the individual networks were calculated as features.
To enhance the utility of T1-weighed MRI data, we proposed a feature extraction method based on Medi-calNet, a pre-trained ResNet-based deep model tailored for medical images [24].We replaced the layers originally used for segmentation with a max-pooling layer (kernel size = 8, stride = 8, padding = 0) and a flattening layer.The pretrained model was fixed and treated as a pure feature extractor.ANTs-pre-processed T1-weighted images (193, 229, 193 dimensions) were input into the model to obtain the output vector as the feature for each sample.
After sequential feature selection, GradCAM [25] was employed to visualize the retained features.The selected features were mapped back to their coordinates as corresponding gradients in the flattening layer.Excluded features were assigned gradients of -0.001.A saliency map was generated and up-sampled for the last convolution layer to visualize the contributing ROIs in the image.

Feature selection
To refine the feature sets, given their potential redundancy and noise, a feature selection step was necessary for effective classification.Minimum Redundancy -Maximum Relevance (mRMR), least absolute shrinkage and selection operator (LASSO), and recursive feature elimination (RFE) were applied sequentially to the original feature sets.mRMR, based on mutual information, selects features with high relevance to the target and low redundancy [26].LASSO, based on L1 regularization, compresses unimportant features to zero to achieve feature selection [27].RFE, based on backward elimination, recursively removes the least important features until the specified number of features is reached.
We sequentially applied these methods to the features extracted from the training set with four feature extraction methods respectively to eliminate irrelevant and redundant features due to the large number of features generated by MRI data, among which LASSO and RFE went through a 5-fold cross-validation to determine optimal hyperparameters.For mRMR, the top 50 features, ranked across the feature sets, were selected for the next step.For LASSO, the optimal regularization parameter α * was used to fit the model on the entire training set to select the features with non-zero coefficients.For RFE, a logistic regression model representing L2 regularization was used as an estimator in RFE.The entire feature selection process was repeated 10 times to generate a more robust feature set.As a result, the feature number of each extraction method resulted in feature sets being reduced separately.

Machine-learning models
Machine learning models were trained on a training set using 5-fold cross-validation and tested on a test set to predict the category of the LCT results (good/bad responders).An ablation study was conducted to assess the contribution of T1-weighted MRI data.This involved comparing the classification performance among three feature sets under the same setting: an imaging set, containing features extracted via four methods respectively; a clinical set, encompassing demographic and clinical information including age, sex, disease duration, LEDD, and MDS-UPDRS III OFF; and union set that combined the imaging and clinical sets.All training set features were used to fit MinMaxScaler to scale the training and test set features.
Optimal hyperparameters for each model were determined through 5-fold cross-validation performed on the training set.The specific model was then trained on the entire training set with the optimal hyperparameters and used to predict LCT results for the test set.Repeated experiments were performed to eliminate random effects.
Our study employed three machine learning models-SVM, XgBoost, and MLP-resulting in nine trained and tested models.

Model performance evaluation
To assess model performance, we used the micro-averaged area under the receiver operating characteristic curve (AUC) as the primary metric.For each feature extraction method and machine learning model, we calculated three AUCs for three test sets generated using three different feature sets.A paired one-tailed t-test was performed between the clinical and union sets to evaluate the statistical significance between the clinical and union models.
If any imaging feature set showed a statistically significant contribution (p < 0.001), the model was further validated on an external clinical dataset to evaluate its generalizability using the best machine-learning method.More specifically, all models trained in the training stage were fixed without further training and modification, resulting in no additional training in the validation stage.The feature labels to be tested were manually selected according to the feature-selection results of the training set, and feature sets to be validated were built by extracting features from an external set according to feature labels.The external set-generated features were normalized using the MinMaxScaler trained on the training set and inputted into the trained model to predict LCT results.

Records inclusion
In this study, we included 219 records from PPMI.  1.

Feature extraction
Four distinct feature sets were generated.The agecorrected regional gray matter intensity yielded 50 principal components from discriminative voxels through PCA.Subcortical texture features yielded 86 features from 16 ROIs each, ultimately reduced to 225 features by postcorrelation-based feature exclusion.The morphological graph contributed 368 features, whereas the pre-trained model of MedicalNet extracted 13,824 features.There were no differences among repeated selections for all four feature extraction methods.

Feature selection
The feature extraction steps culminated in four distinct feature sets.The age-adjusted regional gray matter intensity resulted in only one selected feature out of the 50 PCA features.Subcortical texture encompassed two features situated in the right thalamus out of the 225 input features.For the morphological graph, 18 of the 368 features were selected.Of the 13,824 features extracted from MedicalNet, only 9 were selected.Detailed information on the selected features is presented in Table 2.

Model performance
Table 3 summarizes model performance on the test set.MedicalNet-extracted features consistently outperformed other feature sets across all three models (SVM Union 0.83 ± 0.01, Clinical 0.73 ± 0.01; XgBoost Union 0.80 ± 0.04, Clinical 0.74 ± 0.02; MLP Union 0.80 ± 0.03, Clinical 0.70 ± 0.07; p < 0.001).The best-performing union model, utilizing MedicalNet-extracted features, was SVM, with an AUC of 0.83 ± 0.01 on the test set.For subcortical texture features, only SVM displayed significant improvement (Union 0.79 ± 0.003, Clinical 0.73 ± 0.01, p < 0.001).The MLP exhibited a minor but not statistically significant enhancement from 0.70 ± 0.07 to 0.74 ± 0.08.The addition of texture features to XgBoost decreased the AUC from 0.74 ± 0.02 to 0.73 ± 0.03.Neither regional gray matter intensity features nor morphological network features were significantly improved across the three models.The AUC of the improved feature sets are shown in Fig. 3.

Feature visualization
Using MedicalNet as a feature extractor, we visualized the surviving features after feature selection.The up-sampled saliency map from the last convolution layer revealed key ROIs that contributed to classification.The saliency map and most significant cluster are shown in Fig. 5.This dominant cluster identified several anatomical regions, including the superior temporal gyrus, cingulate gyrus, thalamus, putamen, GPe, GPi, hippocampus, insula, RN, SN, pons, and VTA.

Discussion
In this project, we proposed a feature extraction method based on a pre-trained ResNet-based model.The features of this model outperformed previously published methods on both PPMI and external clinical datasets, demonstrating greater robustness and generalizability than clinical features.Our study also offers insights into the brain regions responsible for levodopa response prediction.
Multiple feature extraction methods were developed to maximize information extractable from T1-weighted MRI for LCT prediction.Although previous studies have demonstrated promising prediction performance using age-corrected regional gray matter intensity (accuracy 74%) and morphological graph (AUC 0.98) features, their conclusions raise uncertainties owing to the small sample sizes and lack of test and external validation sets [12,13].Subcortical ROI texture features (r 2 of 0.76) employed clinical features alongside T1-weighted images with a relatively large sample size and an external validation set, although the imaging features were not evaluated separately [4].Here, we developed a rigorous pipeline to re-evaluate previous methods with three feature combinations with or without both clinical and imaging features.Our results revealed that only the addition of subcortical texture features to the model would significantly improve the classification performance.
Although subcortical texture features showed predictive potential, we aimed to broaden our search for biomarkers beyond this region or with greater improvement.We modified MedicalNet to serve as a deep-learning feature extractor.The union model, incorporating MedicalNetextracted features, outperformed all other methods across all three machine learning models on the test set (p < 0.001 for all).The saliency map, generated to visualize the selected features from MedicalNet, highlighted common subcortical ROIs (putamen, thalamus, GPi, GPe, RN, and SN) and additional ROIs (superior temporal gyrus, cingulate gyrus, hippocampus, insula, pons, and VTA).These findings potentially elucidate the superior performance of MedicalNet-extracted features over subcortical texture features.Gallagher et al [29] reported that subtle changes in anterior cingulate dopamine metabolism may contribute to dysexecutive behaviors in PD.Calabresi et al [30] proposed a link between the hippocampus and dopaminergic system changes in PD.Similarly, Faivre et al [31] suggested that VTA modulates motor and non-motor symptoms related to a partial loss of dopamine cells in PD.Halliday et al reported neuropathological changes in catecholamine cell groups in PD [32].These findings suggest that the newly identified ROIs in our study may indeed be related to dopaminergic system changes in PD, explaining their contribution to LCT prediction.Although more related to cognitive impairment in PD, a D2 receptor loss was observed in the insula of PD patients, potentially affecting LCT results measured using MDS-UPDRS III [33].For the superior temporal gyrus, no direct relationship with dopaminergic system changes in PD has been reported; however, its involvement with PD progression has been suggested [34].
Testing the external clinical set similarly, with only two feature sets previously established as predictive on the test set, a great decrease in performance was observed in the clinical and union models.This indicated potential bias in the clinical information of the external clinical set, which is common in the clinical environment.This drop in performance also questioned the generalizability of p-values were calculated to evaluate the differences between the clinical and corresponding union models clinical information in real-world settings.However, the imaging model with MedicalNet outperformed all other models, with an AUC of 0.64 ± 0.03, demonstrating that the information extracted from objective T1-weighted MRI using MedicalNet was more robust and consistent than that of clinical information.This study had some limitations.Although larger than that of several studies, our sample size was limited, which led to biased models that affected performance and eliminated the possibility of deep learning model training for specific tasks.Only one retrospective external clinical set limited the ability to further evaluate the generalizability of the predictive features.Although data-driven ROIs were identified via the MedicalNet extractor and validated using an external clinical set, their implication in levodopa response prediction and PD progression remains unclear, necessitating more interpretable models  or features to elucidate their pathological roles.Lastly, considering the poor generalizability of clinical information, real-world prediction models need to rely on imaging features exclusively.However, using T1-weighted MRI alone yielded an AUC of 0.64 in the external clinical set, which implies the potential value of imaging data.
In conclusion, T1-weighted MRI offers more robust information than general demographic and clinical features.However, it may not suffice for predicting levodopa response in clinical settings (AUC 0.64 ± 0.03).Therefore, to improve practical LCT prediction performance, future studies should explore advanced imaging for robust feature extraction.A previous study highlighted the utility of T2* images in 16 subcortical ROI [4].Subsequent studies could encompass an investigation of the predictive potential of our newly identified brain regions using T2* or quantitative susceptibility mapping which indicates the iron load [4], integrating this information to generate a more robust and generalizable model for levodopa response prediction.

Fig. 2
Fig.2Study design.Two preprocessing methods were performed on T1w images.Four feature extraction methods were then applied to extract features from the preprocessed images.Three feature selection methods were used sequentially to select the most significant features for classification.Three machine learning models were trained on the training set and tested on the test set to predict the category of LCT result (good/bad responder).An external clinical dataset was also included to evaluate the generalizability of the model.The important features of the MedicalNet extractor were visualized.VBM, voxel-based morphometry; CAT12, computational anatomy toolbox; ANTs, advanced normalization tools; ROI, region of interest; PCA, principal component analysis; mRMR, minimum redundancy maximum relevance; LASSO, least absolute shrinkage and selection operator; RFE, recursive feature elimination; SVM, support vector machine; XgBoost, extreme gradient boosting; MLP, multi-layer perceptron

Fig. 3
Fig. 3 Model performance.A ROC curve of MedicalNet feature sets on the test set with SVM.B ROC curve of MedicalNet feature sets on the test set with XgBoost.C ROC curve of MedicalNet feature sets on the test set with MLP.D ROC curve of subcortical texture feature sets on the test set with SVM

Fig. 4 Fig. 5
Fig. 4 Performance comparison.A Box plots of models with significant improvement from the clinical set to the union set on the test set.B Box plot of the ROC-AUC distributions of different models using MedicalNet extracted features on the external clinical set.Paired one-tailed t-test: ***: 1.00e-04 < p < = 1.00e-03, ****: p < = 1.00e-04

Table 1
Demographic and clinical information for datasets PPMI Parkinson's Progression Markers Initiative, LCT levodopa challenge test, LEDD levodopa equivalent daily dose The training and test sets encompassed 173 records (mean age, 64 ± 9 years, 116 men) and 46 records (mean age, 62 ± 9 years, 28 men), respectively.The external clinical dataset from Ruijin Hospital included 217 records (mean age, 63 ± 9 years, 125 men), with 201 good and 16 bad responders.The demographic and clinical data of all the datasets are summarized in Table

Table 2
Surviving features post-feature selection

Table 3
AUCs on the test set

Table 4
AUCs on the external clinical set using SVM